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Abstract 

We examine what is an efficient and scalable nonlinear solver, with low work 
and memory complexity, for many classes of discretized partial differential equations 
(PDEs) - matrix-free Full multigrid (FMG) with a Full Approximation Storage (FAS) 
- in the context of current trends in computer architectures. Brandt proposed an ex- 
tremely low memory FMG-FAS algorithm over 25 years ago that has several attractive 
properties for reducing costs on modern - memory centric - machines and has not 
been developed to our knowledge. This method, segmental refinement (SR), has very 
low memory requirements because the finest grids need not be held in memory at any 
one time but can be "swept" through, computing coarse grid correction and any quan- 
tities of interest, allowing for orders of magnitude reduction in memory usage. This 
algorithm has two useful ideas for effectively exploiting future architectures; improved 
data locality and reuse via "vertical" processing of the multigrid algorithms and the 
method of r-corrections, which allows for not storing the entire fine grids at any one 
time. This report develops this algorithm for a model problem and a parallel general- 
ization of the original sweeping technique. We show that FMG-FAS-SR can work as 
originally predicted, solving systems accurately enough to maintain the convergence 
rate of the discretization with one FMG iteration, and that the parallel algorithm 
provides a natural approach to fully exploiting the available parallelism of FMG. 

1 Motivation 

Current trends in computer architecture such as non-rising, or even falling, clock rates, 
saturated processor architectures in terms of pipelining, etc., the continued increase in 
the number of transistors on a chip, requires that algorithms be highly concurrent and 
asynchronous. Additionally, we have reached a point where the exponential growth in 
power cost, which goes along with this continued growth of extreme-scale machines, is 
becoming the prohibitive cost of extreme-scale computing. The desire to continue the 
exponential growth of extreme-scale PDE simulations combined with an economic need to 
keep the power budget of a machine down to say 25MW will tax the resources of computer 
engineers and may require that we develop algorithms for radically different machine models 
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with respect to memory, energy and faults than what we have worked with in the past. 
These changes, along with the continued need for mathematically scalable algorithms, as 
we increase the fidelity of our simulations, is leading to the need to rethink our solver 
algorithms for large scale PDE simulations. In particular, the powering and moving of 
memory will become more central to the cost of PDE solves and the flop counts will 
become less so. This paper aims to address the root cause of these future costs - memory 

- by developing low memory and memory movement PDE solver algorithms that exploit 
the mathematical structure of PDEs. 

To rationally develop an algorithm, and certainly to analyze an algorithm, one needs 
a machine model. Traditional complexity theory, essentially counting operations or flops, 
has served this purpose well for high performance computing - it along with its extensions 
to parallel complexity - has a well developed theory. Memory complexity is also useful and 
to some extent serves as a proxy for memory movement complexity. While data locality, 
to reduce memory traffic in the memory hierarchy, has been central to high performance 
computing for decades it is difficult to incorporate memory movement into complexity 
models directly and there is no consensus on any one approach though much work has 
been done in this area O [Sj |l2l |4l [6] . The dearth of good cost models for future machines, 
whose design is an active area of research and far from well understood, leads us to look at 
the fundamental source of costs - memory - and place less emphasis on what has historically 
been the primary measure of costs - flops. 

Multigrid is an efficient solver method for many classes of problems; matrix-free Full 
multigrid (FMG), with a Full Approximation Storage (FAS) for nonlinear problems, is an 
very efficient algorithm for some classes of problems [15], with very low memory and work 
complexity. Brandt proposed an extremely low memory FMG algorithm in his seminal 
1984 report - segmental refinement (SR)(|8j §8.7) - that does not require that all of the 
data be stored at any one time. This is done by reformulating the multigrid algorithms 
from the view of the coarse grid, using the method of r-corrections, so that in effect of 
the fine grid is stored (compressed) on the coarse grid and can be recovered with a special 
multigrid smoothing technique ^3.11 Though the algorithm was originally proposed as a 
low memory serial method that "sweeps" across the grid, using Gauss-Seidel (multiplica- 
tive) smoothers, we propose a parallel variant that uses "patches" and Jacobi (additive) 
smoothers [T3|. This approach in effect allows for low memory to be traded for concurrency 

- the coarse grids are stored explicitly and behave like standard FMG, while the finer grids 
do not explicitly store the entire domain at any one time. This results in higher flop to 
memory ratios than traditional multigrid methods because at some level of granularity, 
say a uniform memory access partition, the same memory is used repeatedly for many 
patches and thus many flops. Additionally, this algorithm requires that the multigrid al- 
gorithm be processed "vertically" , which maximizes data reuse and locality, as opposed to 
the traditional "horizontal" implementation approach where entire grids are processed se- 
quentially (ie, entire grids are first smoothed, then residuals are calculated, then restricted 
to coarse grids, etc. §2.ip . This algorithm is also inherently asynchronous and most nat- 
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urally expressed with an asynchronous, task oriented programming style. This algorithm 
has not been developed because it requires more flops than the traditional approach and is 
more complex to engineer, but it has many attractive properties on more memory centric 
computers. 

This paper proceeds by providing some basic multigrid background in ^ the segmental 
refinement algorithm is developed in ^ along with a parallel FMG-FAS-SR algorithm. We 
apply this algorithm to a model problem in fQ] and conclude in f|5j 

2 Multigrid Background 

Multigrid is an effective method for solving systems of algebraic equations that arise from 
discretized PDEs. Modern multigrid's antecedents can be traced back to Southwell in the 
1930s [14], and Fedorenko in the early 1960s [9j. Brandt developed multigrid's modern 
form in the 1970s - algorithms and analysis with work complexities equivalent to a few 
residual calculations (work units), applied to complex domains, non-constant coefficients 
problems and nonlinear problems [7]. A substantial body of literature, both theoretical 
and experimental exists that proves and demonstrates the optimality of multigrid, having 
0{n) work complexity and 0(log(n)) parallel work complexity or computational depth for 
the Laplacian [15]. Multigrid has been applied to a wide range of problems [HITS], starting 
with flow problems in the seminal paper by Brandt [7j. Multigrid as also been found to 
useful as a nonlinear solver - used directly on the nonlinear system - with demonstrated 
costs very similar to that of a linear multigrid solve ([71 [l5] §5.3.3). 

2.1 Multigrid V-cycle 

Multigrid methods are motivated by the observation that a low resolution discretization of 
an operator can capture modes or components of the error that are expensive to compute 
directly on a highly resolved discretization. More generally, any poorly locally-determined 
solution component has the potential to be resolved with a coarser representation. This 
process can be applied recursively with a series of coarse grids, thereby requiring that each 
grid resolve only the components of the error that it can solve efficiently. This process is 
known as a — cycle because of the shape of the graph in the standard representation 
of these algorithms (see Figure [4]). These coarse grids have fewer grid points, typically 
about a factor of two in each dimension, such that the total amount of work in multigrid 
iterations can be expressed as a geometric sum that converges to a small factor of the work 
on the finest mesh. These concepts can be applied to problems with particles/atoms or 
pixels as well as the traditional grid or cell variables considered here. Multigrid provides a 
basic framework within which particular multigrid methods can be developed for particular 
problems. This framework has proven to be an effective way to separate the near-field 
from the far-field contributions to the solution of say elliptic operators - the coarse grid 



3 



captures the far-field contribution and the near-field is resolved with a local process called 
a smoother. 

The coarse grid space can be represented algebraically as the columns of the prolonga- 
tion operator where h is the fine grid mesh spacing, H is the coarse grid mesh spacing. 
The prolongation operator is used to map corrections to the solution from the coarse grid 
to the fine grid. Residuals are mapped from the fine grid to the coarse grid with the re- 
striction operator is often equal to the transpose of /j^. The coarse grid matrix can 
be formed in one of two ways, either algebraically to form Galerkin (or variational) coarse 
grids {Lh ij^ L^I^) or, by creating a new operator on each coarse grid (if an explicit 
coarse grid is available). 

2.2 Nonlinear multigrid 

The multigrid V~cycle can be adapted to a nonlinear method by observing that the coarse 
grid residual equation can be written as 

ru = Lh{uh) - Lh{uh) = Lh{uh + en) - Lh{uh), (1) 

where u is the exact solution, approximates if^u^, the full intended solution represented 
on the coarse grid, hence the name "Full Approximation Scheme", and e is the error. With 
this, and an approximate solution on the fine grid Uh, the coarse grid equation can be 
written as 

Lh {ihUh + en) = Lh {iJ^Uh) + /f {fh - LhUh) , (2) 

and is solved approximately. After I^Uh is subtracted from the I^Uh + en term the 
correction is applied to the fine grid with the standard prolongation process. This method 
is called Full Approximation Scheme (or Full Approximation Storage - FAS), because the 
full solution is stored on each level and not just a residual correction. See Trottenberg for 
more details {15] . 

Figure [1] shows the FAS multigrid V(vl,u2)- cycle and uses a nonlinear smoother u ^ 
S{L,u,f). 

With M coarse grids the preconditioner (solver) for Lmum = Jm is u = FAS{Lm,^-: fhi) 

2.3 Full Multigrid 

An important variant on the V -cycle is the F-cycle or related full multigrid (FMG). The 
multigrid F-cycle restricts the right hand side from the fine grid to the coarsest grid and 
then applies a multigrid cycle, of some sort, at each level, starting with the coarsest level 
and interpolating the solution to the next finest level as an initial solution for the next 
V-cycle. A higher order interpolator between the level solves, II^^, is needed for optimal 
efficiency of the FMG process but requires more data movement. An attractive property of 



4 



u =function FAS{Lk,Uk, fk) 



if /c > 



Uk ^ S''^{Lk,Uk, fk) 



vl iterations of the (pre) smoother 



rk^ fk- LkUk 



- restriction of residual to coarse grid 

- restriction of solution to coarse grid 

1, rfc_i + Lk-iUk-i) - recursive application 
Uk~i) - prolongation of coarse grid correction 

- v2 iterations of the (post) smoother 



Ck-i ^ FAS{Lk-i,Uk 

Uk^ Uk + /^_i(Cfc_l - 

Uk ^ S''^{Lk,Uk,fk) 



else 

Uk ^ L^^fo 
return Uk 



exact solve of coarsest grid 



Figure 1: FAS Multigrid V -cycle Algorithm 



the F~cycle is that for some operators it has been proven that one F-cycle is sufficient to 
reduce the error to the order of the truncation error, which is often all that is required [5] 
( [15] §3.2.2). Thus, the algebraic system can be solved to spatial truncation error accuracy 
with a work complexity of a few work units, or residual calculations. Note, the parallel 
complexity of an F~cycle does have an extra log{n) factor. 

One can analyze the F-cycle with induction where the induction hypothesis is that the 
algebraic error is some factor r of the truncation error (which is satisfied on the coarsest 
grid where an accurate solver is required) , and the standard assumption that the truncation 
error is of the form 0{W), and that the solver on each level (eg, one V -cycle) reduces the 
error by some factor P (which can be proven or measured experimentally) to derive an 
equation that directly relates r to P. This allows the use of the desired ratio - any desired 
ratio - of solution to truncation error to tune the solver at each level - see Adams for 
the application of these ideas to compressible resistive magnetohydro dynamics where two 
V -cycle were used as the level solver pj. 

FMG starts with the coarse grid, and is more natural in an AMR context; it simply 
omits the initial restriction of the residual to the coarse grid. Figure [2] shows the FMG 
algorithm. 

FMG-FAS multigrid is an efficient solver for some classes of problems and its application 
to new classes of problems is an active area of research [IJ. This method was developed in 
the 1970s and was attractive because of its low memory requirements: only requiring the 
field variables themselves, and because of its very low work complexity (as low as six work 
units to solve to truncation error). After the profligate era of the 1980s to the 2000s, with 
large amounts of uniform access memory available, low memory complexity algorithms 
are attractive again as we move to memory centric cost models. Thus, FMG-FAS is an 
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u =function FMG 

uo ^ FAS (Lo, 0, /o) - exact solve of coarsest grid 
for k=l:M 

Uk ^ n^_]^nfc_i - FMG prolongation 

Uk ^ FAS {Lk,Uk, fk) - V-cycle 
return um 

Figure 2: FMG-FAS algorithm 

attractive solver algorithm for the anticipated machine models for exa-scale machines. 
2.4 Segmental Refinement and PDE Compression 

Looking at FAS from a two grid point of view we can rewrite the coarse grid Eq. [2] as 

where is some fine-to-coarse transfer which need not be the same as (they are 
in principle defined on different spaces), and is the current solution on the fine grid. 
Having obtained an approximate solution from solving Eq. Owe can write the fine grid 
correction as 

^%EW = u^ + 4 (u"" - il^n') . (4) 



Or use the solution directly: 

Generally Eq. [5]is not preferred because it introduces interpolation error in the full solution, 
and not just a correction, but it will be useful in the context of this work. 

We can look at the FMG method from the dual point of view, that is from the view 
of the coarse grid. Instead of looking at the coarse grid as an accelerator to the fine grid 
convergence we look at a fine grid as a correction to the coarse grid problem. Eq. [3] can 
be rewritten in the form: 

L u = f , (6) 

where the r-correction is 

-H _ tH ( fH~h\ tH ( Th~h 



T^=L-[Il^n-)-I{^[L-u-), (7) 

and = I}^ ■ At convergence = ij^u^ , hence is the fine-to- coarse defect correc- 
tion designed to make its solution coincide with the fine-grid solution. This observation, 
along with the update of Eq. [5] allows for an FMG-FAS algorithm that need not store the 
fine grids, but can compute them locally patch-by-patch. Brandt proposed the segmental 
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refinement method that exploits this property by "sweeping" through the grid refining one 
segment at a time (§8.7 

Note, with the r-correction, the coarse grid solution is equal to the fine grid solution 
at the coarse grid points - this allows for the inexpensive computation of the solution 
with a special relaxation method in the post smoothing leg of the V-cycle ( ^3.ip . Thus, 
this representation can be viewed as a compression technique that exploits the PDE and 
multigrid method - PDE compression. 

3 Algorithm 

The duel view of FMG allows the r-corrections to be computed on subdomains and the fine 
grid data need not be retained in memory. In serial this allows only small parts of the fine 
grids to be stored at any given time as the algorithm "sweeps" through the grids, computing 
the T-corrections and restricting them to the coarse grid. This algorithm also has a high 
degree of concurrency - the low memory properties of the algorithm can be "traded" for 
concurrency. Exploiting these observations requires looking at the data dependencies of 
the FMG algorithm. 

To fully exploit the available parallelism in the FMG algorithm we generalize the sweep- 
ing process of the original algorithm by defining a regular "patch" i of cells u^, on grid k, 
with say 4-64 cells on a side. Define a partitioning of each grid into a non-overlapping set 
of patches G^, and an extension of a patch Uj by some number of cells as ui - these are 
halo or buffer cells and they allow for subdomain solves with inaccurate boundary condi- 
tions to be solved accurately in the region of interest Ui without communication. These 
extended patches are conceptually similar to buffer regions that are used in algorithms to 
reduce the number of messages at the expense of sending more data and redundant work 
Define a solver or smoother S on an extended patch with non-homogenous boundary 
conditions that returns an improved solution on the original - non-extended - patch of 
data. This smoother is used as the coarse grid solver for notational convenience and it 
must be accurate when used on the (entire) coarse grid. This smoother takes two extra 
arguments - Pi_^_^ and u- ~ for use in a Kaczmarz smoother ( ^3.1|) . The smoother assembles 
these patch or "block" solve solutions additively, in a block Jacobi method, to increase the 
degree of parallelism over the multiplicative method that is natural in serial. Assume that 
the coarsest grid, grid u", is composed of only one patch, again for notational convenience, 
and that each subsequent grid is a simple refinement by a small integer refinement ratio 
(ie, two or four). The size of the group of patches on each level is a factor of eight times 
larger than the next coarser level in 3D in a non-AMR solve with a refinement ratio of two 
(or 64 with a refinement ratio of 4). An AMR solve, with nested constant size patches, 
would pruned these groups appropriately. Figure [3] shows a sketch of a parallel segmented 
refinement algorithm assuming M coarse grids and the forcing function / has been suitably 
interpolated to, or defined on, all levels. 
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function FMG-FAS-SR 

for /c = : M - 1 
for all uf e 



^ n^i+^u'' - FMG prolongation 



) 

"^i ^ -^fc+i^f^^ ~ restrict solution 

^ L'' (ik+iUi^^) - It+i (l''+^u^^^) - data dependence for u''+^ 



fk , jk f _|_ _fc 

for all n*^ G 



for / = A; : — 1 : 1 - pre-smoothing leg of V-cycle 

for all ul £ G' 

^ i\~^u\ - restrict solution 



cl-1 , tI-1 fl I Z-1 



for I = : k - post-smoothing leg of V-cycle 



if / = M - 1 



for all ul G G' 



else 



u''j^^ ^ n'^^ti' - using Eq. [5]& HO prolongation 

v^'^s{L^+\vt\ft\lLv^ -useCR 
compute functional of n-^^ - fine grid 

^ -"^^ + -^i"^^ - - using Eq. 1 



/+1 , c / r/+l 



S L^^\v^\ft 



Figure 3: Segmented refinement FMG-FAS-SR algorithm (lines labeled 1-5 must be fused 
to avoid the need to store u'^) 



The dependency graph of this algorithm is similar to a forest of oct-trees with additional 
dependencies between neighboring trees. If simple averaging is used in restriction then 
processing a coarse grid patch depends on the child patches (eg, an oct-tree), with 
a refinement ratio of in Z? dimensions. Higher order interpolation, which we use for 
prolongation, adds edges in the data dependency graph between the trees. 

Figure S] shows the FMG cycle, the r-corrections and the fine grid processing that 
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can be fused and processed without permanent storage. Note, that we use the higher 




Figure 4: FMG cycle with r corrections; dashed boxes show fused matrix free processing 



order interpolation on the grids with full updates (SR levels), the finest level only in this 
figure. This algorithm posses a high degree of concurrency, with, for instance, ten levels of 
refinement resulting in over one billion way parallelism in 3D and R = 2. 



3.1 Compatible Relaxation and Kaczmarz Smoother 

The critical change that we have made to the mathematical algorithm, to avoid storage of 
the finest grids, is the use of Eq. [5] to update the solution on the finest grids. This method 
of not using a residual correction form, of using a full update, has the disadvantage that it 
adds coarse grid interpolation error to the whole fine grid solution instead of only to a cor- 
rection. We can ameliorate this problem by using stronger smoothers and using compatible 
relaxation (CR). CR uses a distributive relaxation or Kaczmarz relaxation in combination 
with a standard point-wise soother like Gauss-Seidel [101 E] • Note, extra smoothing steps 
may be required, using extra flops, but because we have taken care to insure good data 
locality no additional memory movement is required, which is acceptable in the machine 
model that we are optimizing for. We wish to maintain the approximation properties of 
the coarse grid while allowing smoothing of the error on the fine grid. One approach for 
maintaining the approximation properties of the coarse grid is to (approximately) constrain 
the fine grid solution to solve 

= If ^^ (8) 

Figure [5] shows the CR algorithm for our smoother on full update levels, that alternates 
between a standard smoother and a Kaczmarz relaxation. 



3.2 The Solution 

A challenge of not explicitly storing the solution is the obvious problem of getting desired 
data from the simulation. There are two basic methods for computing quantities of interest 
in the segmented refinement approach: 1) collect a functional of the data as the solution is 
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function S{L^, u^, f^,ll^,u") 



for all j in patch p 



r - ll^{j,:)u'^ 

t^r/P{j,j) 



- residual 

- scalar correction 
update (distributive) 



standard smoother on patch u' 



h 



Figure 5: Kacmarz smoother on a patch 



computed, including streaming the entire solution to a "file" for later processing (certainly 
useful for small simulations and debugging) and 2) storing a coarse grid solution which can 
be expanded or uncompressed efficiently with local processing (ie, PDE decompression) on 
demand for analysis. 

4 Numerical Studies 

We investigate the properties of the algorithms developed here with a ID Laplacian with 
homogenous Dirichlet boundary conditions and constant material coefficients. We use a 
second order finite volume discretization, second order multigrid prolongation, fourth order 
FMG interpolation (11) and first order accurate restriction operators. The experiments are 
run in Matlab. The Matlab source code is listed in f|6l We do not use a data driven 
(vertical) processing that is a potential result of the algorithm in Fig. [3l but our simple 
(horizontal) processing of the algorithm does have the same semantics as our proposed 
algorithm. The smoother does simulate the asynchronous algorithm in that it is additive, 
a block Jacobi method, and so it is invariant to the order of processing of the blocks. Each 
block has two non-overlapped cells, whose result is returned by the smoother, and two 
or four halo cells on each side (except at boundaries of course). The solver within each 
subdomain is a few iterations of Gauss-Seidel, or compatible relaxation on the SR (full 
update) levels. 

To ascertain the costs of the proposed algorithm we conduct convergence studies, plot- 
ting the differential error \u — u\2 as a function of the number of cells. The discretization 
method is second order accurate and so we wish to maintain second order accuracy in 
our approximate solution with one FMG cycle. We consider on 0-3 levels of SR in each 
study and look at the number of halo cells (two and four) and the number of Gauss-Seidel 
iterations (one and two) in the subdomain solver of the Jacobi smoother. One application 
of the outer Jacobi smoother is used at all times. 

Figure El shows convergence studies for the two halo smoother subdomains. This data 
shows that truncation error accuracy (of the fine grid) is lost to some extent with three SR 



10 



Error of FMG vs. FMG-SR, w/ 2 halo cells & V(l,l) cj Error of FMG vs. FMG-SR, w/ 2 halo cells & V(2,2) cj 
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Figure 6: Convergence study with two halo cells in subdomains solver; V(l,l) cycle (left); 
V(2,2) cycle (right) 



levels and V{1, 1) cycles, but otherwise we observe good second order convergence. 

Figure [7] shows convergence studies for the four halo smoother subdomains. This data 



Error of FMG vs. FMG-SR, w/ 4 halo cells & V(l,l) cy 



Error of FMG vs. FMG-SR, w/ 4 halo cells & V(2,2) cy 
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Figure 7: Convergence study with four halo cells in subdomains solver; V(l,l) cycle (left); 
V(2,2) cycle (right) 



shows that with four halo cells the accuracy is very good with all solver configurations. 

For reference, Figure [8] shows convergence studies using a simple point-wise Gauss- 
Seidel smoother (ie, the subdomain solver applied to the entire grid). This data shows 
that the convergence results that we get, on this test problem, when using a standard 
(multiplicative) smoother are a bit "cleaner" than that of the Jacobi smoothers in Figures 
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FMG vs. FMG-SR, w/ global smoother k V(l.l) Error of FMG vs. FMG-SR, w/ global smoother k V(2.2} 
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Figure 8: Convergence study with point-wise Gauss-Seidel smoothers; V(l,l) cycle (left); 
V(2,2) cycle (right) 



El and El 

5 Conclusions 

We have developed mathematical understanding of a highly concurrent FMG-FAS multi- 
grid algorithm based on the r-correction and segmented refinement approach. The method 
has the advantage of possessing very high levels of concurrency and is highly asynchronous. 
This method also posses good data reuse properties because processing is confined to 
patches where operations can be "fused" , obviating the need to even store the entire solu- 
tion at any one time. We use overlapping subdomains which allows for accurate subdomain 
solves in the smoothers without communication. These subdomain solves can be relatively 
accurate because the data is local (eg, in cache or fast memory of some sort) with low 
memory movement cost. Interesting areas of future research are applying these methods to 
higher order discretizations, systems of PDEs, transient problems and hyperbolic problems, 
in a parallel and in an asynchronous environment. 
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6 Appendix 

The data shown in this paper is generated with a Matlab code, shown here. Run "conv(12)" 
to do a convergence study with 12 levels (4096 cells on the finest level). The coarsest level 
is at level 2, so there are actually nine multigrid levels when ten levels are requested. 

function conv( M ) 

°/o conv: Run convergence test for fmg() 
% Plot errors vs. N. 
close all 

set(0, 'Def aultFigureWindowStyle' , 'docked') 
Iw = 1.5; fz = 18; 
s = 4; s_N = 2~s; s_h = l/s_N; 
nmodes = 16; 
for halo_type=l : 3 
for ns=l:2 

qq = s_h~2; 
for k = s:M 

err_noRS(k)= fmg(k,0,2,0,ns,halo_type) ; 
err_RSl(k) = fmg(k, 1 ,2 ,0,ns ,halo_type) ; 
err_RS2(k) = fmg(k,2,2,0,ns,halo_type) ; 
err_RS3(k) = fnig(k,3,2,0,ns,halo_type) ; 
quad(k) = qq; 
qq = qq/4; 

end 

figure 

p=s :M; 

p2 = 2.-p; 

loglog(p2, err_noRS(4:M) , 'kd — ' , 'linewidth' ,lw) , hold on 
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loglog(p2, err_RSl(4:M) , 'bx- . ' , 'linewidth' ,lw) , hold on 
loglog(p2, err_RS2(4:M) , 'ro :', 'linewidth' ,lw) , hold on 
loglog(p2, err_RS3(4:M) , 'g*-' , 'linewidth' ,lw) , hold on 
loglog(p2, quad(4:M), 'm- ',' linewidth' , Iw) , hold on 
setCgca, 'XTick' ,p2) 
V = axis; 

V(l) = s_N; V(2) = 2"M; 
axis (V) 

legendCFMG' , 'FMG-SR 1-grid' , 'FMG-SR 2-grids' , 'FMG-SR 3-grids ' , ' quadradic ' ) 
ylabelClu - ~u| _2/|u| _2' , 'f ontsize' ,fz) ; 
xlabeK'N cells' , 'fontsize' ,fz) ; 
if halo_type==3, 

titleC ['Error of FMG vs. FMG-SR, w/ global smoother \& V( ' ,num2str (ns) , ' , ' ,nuin2 
grid 

print( gcf, '-djpeglOO', ['conv_global_bc_' ,nuin2str(ns) , 'smooth'] ) 
printC gcf, '-depsc', ['conv_global_bc_' ,num2str(ns) , 'smooth'] ) 

else 

title(['Error of FMG vs. FMG-SR, w/ ' ,num2str(2*halo_type) , ' halo \& V(',num2st 

grid 

printC gcf, '-djpeglOO', ['conv_' ,num2str(2*halo_type) , 'bc_' ,niim2str(ns) , 'smoot 
printC gcf, '-depsc', ['conv_' ,num2str(2*halo_type) , 'bc_' ,nijm2str(ns) , 'smooth'] 

end 

end 

end 

function [ error ] = fmg( M, nRS, MO, pflag, ns, halo_type, nmodes ) 

°/o fmg: fmg-fas with segmetnted refinement. 

% M - number of coarse grids . 

% MO - level of coarsest grid. 

if nargin < 4, pflag = 1; end 

if nargin < 3, MO = 2; end 

if nargin < 2, nRS = 0; end 

if nargin < 1 , M = 5 ; end 

NN = 2~M, h_M = 1/NN; 

if nargin < 7, nmodes = NN/16; end 

if nargin < 5 , ns = 1 ; end 

if nargin < 6, halo_type = 2; end 

set(0, 'Def aultFigureWindowStyle' , 'docked') 

uu = cell(l,M) ; 
rhs_orig = cell(l,M); 
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N = NN; 

for k=M:-l:MO 

M_lev(k) = N; N = N/2; 
end 
I 

[ L rhs ext Prol Rest RRt Prol_FMG ] = getopsC M, nRS, MO, nmodes ); 

7o 

y. FAS-FMG-SR w/ tau correction 
% 

prt = 0; 

7o FMG up to finest grid M 

uu{MO> = smoothC L{MO}, 0, rhs{MO}, 1, MO ); % coarsest grid solve 

if prt, coarse_smoothing_error_infnorm = [MO N_lev(MO) norm(uu{MO}-ext{MO},inf ) norm(rhs{MO 
for k=MO:M-l 

uu{k+l} = Prol_FMG{k> * uu{k>; % FMG prol. 

% presmooth fines grid at this level 

if prt, pre_v_cycle_err_res_inf = [k+1 N_lev(k+1) nonii(uu{k+l}-ext{k+i} , inf ) norm(rhs{k 
rlis_orig{k+l} = rhs{k+l}; 

uu{k+l} = smoothC L-[k+l>, uu-[k+l}, rhs{k+l}, ns, MO, 0, 0, 0, halo_type ); 
% pre smoothing + coarse grid 
for m=k:-l:MO 

uu{m} = Rest{m}*uu{m+1}; % initial guess for coarse grid 

rhs{m} = Rest{m}*rhs{m+1} + L{ni}*uu{m} - Rest{m}*L{m+l}*uu{m+l} ; 
uu{ni} = smoothC L{m}, uu-[m]-, rhs{m}, ns, MO, 0, 0, 0, halo_type ); 

"/oif prt, pre_smoothing_error_infnorm = [m N_lev(m) norm (uu{m}-ext{m}, inf ) norm(rhs{ 

end 

'/o post smoothing 
for m=MO:k 

if m < M-nRS, 

uu{m+l} = uu{m+l} + Prol{m}* (uu{m} - Rest{m}*uu{m+1}) ; 

uu{m+l} = smoothC L{m+1}, uu{m+l}, rhs{m+l}, ns, MO, 0, 0, 0, halo_type ); 

else 

uu{m+l} = Prol_FMG{m>*uu{m> ; 

ns2 = ns; %[ 2*CM-m) N_levCm+l) ], % Cm-M+5) 

uu{m+l} = smoothC L{m+1>, uu{m+l>, rhs{m+l}, ns2Cl) , MO, uu{m}, diagCRRt{m}) , R 
°/oUu{m+l} = smoothC L{m+1>, uu{m+l}, rhs_orig{m+l} , ns2Cl), MO, uu{m>, diagCRRt{ 

end 

%if prt, post_smoothing_error_infnorm = [m+1 N_levCm+l) normCuu{m+l}~ext{m+i} , inf ) 

end 

if prt, post_v_cycle_err_res_inf = [k+1 N_levCk+l) norm Cuu{k+l}-ext {k+1} , inf ) normCrhs{ 

end 
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%res_red = norm(rhs{M}-L{M}*uu{M})/norm(rhs{M}) , 
7oerr_red = norm (uu{M}-ext{M}) /norm (ext{M}) , 
%f igure 

y.plot(rhs{M>-L{M>*uu{M>, 'b:*') , hold on, 

%plot(rhs{M},'r:o'), hold on, 

%pause 

% plot & error 
if pflag, 

close all 

figure 

XX = h_M/2 + h_M*(0:NN-l) ; 

plotCxx, uu{M>, 'r* — '), hold on 

plot(xx, rhs{M}, 'go — '), hold on 

plot(xx, ext{M}, 'bx-0, hold on 

plot(xx, abs(ext{M>-uu{M>) , 'md-'), hold on 

axis([0 1 1. l*max(uu{M})] ) 

legend ( ' result ' , ' b ' , ' x ' , ' error ' ) 

end 

error = norm(ext-[M}— uu{M},2)/norm(ext{M},2) ; 
end 

function [ L rhs ext Prol Rest RRt Prol_FMG ] = getopsC M, nRS, MO, nmodes ) 
% getops: create opertors for FMG. 

% ID 2nd order finite volume discretization of Laplacian with Dirichlet 
°/o boundary conditions. First order restriction, 2d oreder prolongation 
/o and 4th order FMG interpolation. 
NN = 2"M; h_M = 1/NN; 

I 

°/o Form restriction and prolongation ops 
1 

Prol = cell(l,M-l) ; 
Rest = cell(l,M-l); 
RRt = cell(l,M-l); 
n=2~M0; m=2*n; 
for k=MO:M-i 

P = zeros (m,n); PO = zeros (m,n); 

P(2,l) = 3; P(m-l,n) = 3; 

P(l,l) = 2; P(m,n) = 2; 

P0(2,l) = 1; PO(m-l,n) = 1; 
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P0(1,1) = 1; PO(m,n) = 1; 
if m > 2, 

P(m-2,n) = 1; P(3,l) = 1; 

end 

for j=2:n-l 

jj = (j-2)*2 + 2; 
PP = jj:jj+3; 
P(pp,j) = [13 3 1]; 
PP = jj+l:jj+2; 
PO(pp,j) =[11]; 

end 

Prol{k} = sparse (0. 25*P) ; 
7.Rest{k} = 0.125*P' ; 
Rest{k} = sparse(0.5*P0') ; 
RRt{k} = Rest{k}*Rest{k}' ; 
m = m*2; n = n*2; 
end 

y. 

y. Form L & Prol_H 

y. 

L = celKl.M); 
Prol_FMG = cell(l,M-l); 
N = NN; h = h_M; 
for k=M:-l:MO 

A = 2*eye(N) - diag(ones(N-l , 1) , 1) - diag(ones(N-l, 1) ,-1) ; 
A(l,l) = 3; 

A(N,N) = 3; 

L{k> = sparse (A* (1/h) ~2) ; 
if k > MO, 

y.Prol_H{k-l} = Prol{k-l}; 

P = zeros(N,N/2) ; 

P(l,l) = 70; P(l,2) = -2; 

P(2,l) = 112; P(2,2) = 35; P(2,3) = -5; 

P(3,l) = 40; P(3,2) = 105; P(3,3) = -7; 

P(4,l) = -7; P(4,2) = 105; P(4,3) = 35; P(4,4) = -5; 

y. 

for i=5:2:N-4 

j = (i-l)/2 + 1; 

P(i,j-2) = -5; P(i,j-1) = 35; PCiJ) = 105; P(i,j + 1) = -7; 

P(i+l,j-l) = -7; P(i+l,j) = 105; P(i+l,j+l) = 35; P(i+l,j+2) = -5; 

end 
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% 

j = N/2; 

P(N,j) = 70; P(N,j-l) = -2; 
P(N-l,j) = 112; P(N-l,j-l) = 35; P(N-l,j-2) = -5; 
P(N-2,j) = 40; P(N-2,j-l) = 105; P(N-2,j-2) = -7; 

P(N-3,j) = -7; P(N-3,j-l) = 105; P(N-3,j-2) = 35; P(N-3,j-3) = -5; 

7o 

Prol_FMG{k-l} = (1/128) *sparse (P) ; 

end 

N = N/2; h = h * 2; 
end 

y. 

% Form f 
I 

rhs = cell(l,M) ; 

ext = celKl.M) ; 

N = NN; h = h_M; 

X = h/2 + h*(0:N-l) ; 

rhs{M> = 0*x'; ext{M} = 0*x' ; 

for j=l : 2 :nmodes , 

rhs{M} = rhs{M} + (l/j)*sin(j*pi*x) ' ; 

ext{M} = ext{M} + (l/j)*sin(j*pi*x) V(j*pi) "2; 

end 

f = rhs{M}; 
e = ext{M}; 
"/.figure 

y,plot(x, (ext{M> - L{M}\rhs{M}) ./ext{M}, 'o— ') , hold on 
yoplot(x,ext{M>, 'or — '), hold on 
for k=M-l:-l:MO 

f = Rest{k} * f ; 

rhs{k> = f; 

e = Rest{k} * e; 

ext{k} = e; 

%N = N/2; h = h * 2; 

y.x = h/2 + h*(0:N-l) ; 

y,plot(x, (ext{k} - L{k>\rhs{k» ./ext {k>, '*-.') , hold on 
yoplot(x,ext{k}, '*k-. ') , hold on 

end 

XprintC gcf, '-djpeglOO', 'disc_error' ), grid, pause 
end 
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function u_out = smooth( L, u_0, f , ns, MO, u_H, RRt, R, halo_type, a_pl, a_p2 ) 
% smooth: SR smoother 

7o 

[N x] = size(L) ; 
omega = 1 . ; 
sqrt2i = l/sqrt(2) ; 
if N == 2~M0, 

% coarse grid solve 

u_out = L \ f; 

else 

if nargin < 10, 

°/o whole level, additive Schwarz 
if halo_type==l , 

u_out = zeros(N, 1); 
pi = 1:2; 
p2 = 1:4; 

u_out(pl) = smoothC L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 
for ii=6:2:N, 

pi = (ii-3):(ii-2); 

p2 = (ii-5) : ii ; 

u_out(pl) = smoothC L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 

end 

pi = (N-1):N; 
p2 = (N-3) :N; 

u_out(pl) = smoothC L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 
elseif halo_type==2 , 

u_out = zeros(N, 1); 
pi = 1:2; 
p2 = 1:6; 

u_out(pl) = smoothC L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 

pi = 3:4; 
p2 = 1:8; 

u_outCpl) = smoothC L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 
for ii=10:2:N, 

pi = Cii-5) : Cii-4); 

p2 = Cii-9) :ii; 

u_outCpl) = smoothC L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 

end 

pi = CN-3):CN-2); 
p2 = CN-7) :N; 

u_outCpl) = smoothC L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 



20 



pi = (N-1):N; 
p2 = (N-5) :N; 

u_out(pl) = smooth ( L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 

else 

1 whole level, G-S 
u_out = zeros(N, 1); 

pi = 1:N; p2 = pi; 

u_out(pl) = smooth ( L, u_0, f, ns, MO, u_H, RRt, R, halo_type, pi, p2 ); 

end 

else 

"/o one subdomain 
for k = l:ns, 

% distributed (Kaczmarz) relaxation 
[nl n2] = size (RRt) ; 
if nl+n2 > 2, 

°/o setup coarse grid iterator 

if a_p2(l) < a_p2(2), inc = 2; off = 0; else inc = -2; off = 1; end 

[x n] = size(a_p2); 

% distributed (Kacmarz) relaxation 

for ii=(a_p2(l) :inc:a_p2(n)) - off 

jj = (ii-l)/2 +1; 7. H index 

r = u_H(jj) - R(jj,:)*u_0; 

t = r / RRt(jj); % update for RR' y = x~H 

u_0 = u_0 + R(jj,:)' * t; % update x'h 

end 

"/.error = [k size(a_p2) norm(u_H - R*u_0)] 

end 

% normal smoothing 
for ii=a_p2 

u_0(ii) = u_0(ii) + omega * (f(ii) - L(ii,:)*u_0) / L(ii,ii); 

end 

% symmetrize 

a_p2 = fliplr(a_p2) ; 

end 

u_out = u_0(a_pl); 

end 

end 
end 
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